In this section, we will focus on comparing the expression levels of genes across different samples. This analysis sets aside the task of estimating the different kinds of RNA molecules, and the different isoforms for genes with multiple isoforms. One advantage of looking at these matrices of counts is that we can use statistical distributions to model how the variance of counts will change when the counts are low vs high. We will explore the relationship of the variance of counts to the mean later in this lab.
In this section we will examine 75 samples from Waldenstrom Macroglobulinemia patients, in a cohort called ???zhunter,??? from Harvard University.
setwd("/Users/yah2014/Dropbox/Public/Olivier/R/zhunter/Mutation");getwd() #set_up_enviroment
## [1] "/Users/yah2014/Dropbox/Public/Olivier/R/zhunter/Mutation"
sample_zhunter.table <- read.csv("sample_table_zhunter.csv")
fileName <-paste0(sample_zhunter.table$BioSample, ".bam.count")
sample_zhunter.table <- data.frame(sampleName = sample_zhunter.table$sampleName,
fileName = fileName,
Condition= sample_zhunter.table$Condition,
QC = sample_zhunter.table$QC,
Label= sample_zhunter.table$Label,
BioSample = sample_zhunter.table$BioSample)
head(sample_zhunter.table)
## sampleName fileName Condition QC
## 1 ZH10_NWM07_CTTGTA ZH10_NWM07_CTTGTA_L005_R1_001.bam.count Cancer PASS
## 2 ZH10_rnaWT10 ZH10_rnaWT10_R1_.tmp.bam.count Healthy PASS
## 3 ZH11_NWM09_ACAGTG ZH11_NWM09_ACAGTG_L005_R1_001.bam.count Cancer PASS
## 4 ZH11_rnaWT11 ZH11_rnaWT11_R1_.tmp.bam.count Healthy PASS
## 5 ZH12_NWM10_GTGAAA ZH12_NWM10_GTGAAA_L006_R1_001.bam.count Cancer PASS
## 6 ZH12_rnaWT12 ZH12_rnaWT12_R1_.tmp.bam.count Healthy PASS
## Label BioSample
## 1 NWM ZH10_NWM07_CTTGTA_L005_R1_001
## 2 rnaWT ZH10_rnaWT10_R1_.tmp
## 3 NWM ZH11_NWM09_ACAGTG_L005_R1_001
## 4 rnaWT ZH11_rnaWT11_R1_.tmp
## 5 NWM ZH12_NWM10_GTGAAA_L006_R1_001
## 6 rnaWT ZH12_rnaWT12_R1_.tmp
invisible(library(DESeq2))
directory <- "/Users/yah2014/Dropbox/Public/Olivier/R/ALL_COUNTS/zhunter_Counts" #dir for HTSeq Counts
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sample_zhunter.table,
directory = directory,
design= ~ Condition)
ddsHTSeq
## class: DESeqDataSet
## dim: 23718 75
## metadata(1): version
## assays(1): counts
## rownames(23718): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(75): ZH10_NWM07_CTTGTA ZH10_rnaWT10 ... ZH9_NWM06_GCCAAT
## ZH9_rnaWT9
## colData names(4): Condition QC Label BioSample
if Error in colnames<-(*tmp*, value = 1:75) : attempt to set ‘colnames’ on an object with less than two dimensions clean the first row with the empty first column and “0” in the second column in HTSeq.count files use terminal, run below command lines to remove first line for all files: cd /Users/yah2014/Dropbox/Public/Olivier/R/ALL_COUNTS/zhunter_Counts for file in \((ls);do echo "\)(tail -n +2 $file)" > $file;done
removing rows in which there are no reads or nearly no reads
ddsHTSeq <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ]
Note on factor levels
ddsHTSeq$Condition <- factor(ddsHTSeq$Condition, levels=c("Cancer","Healthy","Unclear"))
ddsHTSeq$Label <- factor(ddsHTSeq$Label, levels=c("NWM","rnaWT","WM","MWCL","BCWM"))
The following estimates size factors to account for differences in sequencing depth, and is only necessary to make the log.norm.counts object below.
dds <- estimateSizeFactors(ddsHTSeq)
head(sizeFactors(dds))
## ZH10_NWM07_CTTGTA ZH10_rnaWT10 ZH11_NWM09_ACAGTG ZH11_rnaWT11
## 1.1600472 0.5454389 1.5941689 0.4574318
## ZH12_NWM10_GTGAAA ZH12_rnaWT12
## 2.0997319 0.2261451
head(colSums(counts(dds)))
## ZH10_NWM07_CTTGTA ZH10_rnaWT10 ZH11_NWM09_ACAGTG ZH11_rnaWT11
## 46912880 17611380 45321999 15580851
## ZH12_NWM10_GTGAAA ZH12_rnaWT12
## 58828636 6868649
plot(sizeFactors(dds), colSums(counts(dds)))
abline(lm(colSums(counts(dds)) ~ sizeFactors(dds) + 0))
Size factors are calculated by the median ratio of samples to a pseudo-sample (the geometric mean of all samples). In other words, for each sample, we take the exponent of the median of the log ratios in this histogram.
loggeomeans <- rowMeans(log(counts(dds)))
hist(log(counts(dds)[,1]) - loggeomeans,
col="grey", main="", xlab="", breaks=40)
The size factor for the first sample:
exp(median((log(counts(dds)[,1]) - loggeomeans)[is.finite(loggeomeans)]))
## [1] 1.160047
sizeFactors(dds)[1]
## ZH10_NWM07_CTTGTA
## 1.160047
Make a matrix of log normalized counts (plus a pseudocount):
log.norm.counts <- log2(counts(dds, normalized=TRUE) + 1)
Examine the log normalized counts (plus a pseudocount).
rs <- rowSums(counts(dds))
boxplot(log.norm.counts[rs > 0,]) # normalized
Make a scatterplot of log normalized counts against each other. Note the fanning out of the points in the lower left corner, for points less than \(2^5 = 32\).
plot(log.norm.counts[,1:2], cex=.1)
Now we will use a more sophisticated transformation, which is similar to the variance stablizing normalization method taught in Week 3 of Course 4: Introduction to Bioconductor. It uses the variance model for count data to shrink together the log-transformed counts for genes with very low counts. For genes with medium and high counts, the rlog is very close to log2. For further details, see the section in the DESeq2 paper.
This will take five minutes!!
#rld <- rlog(dds)
#plot(assay(rld)[,1:2], cex=.1)
Another transformation for stabilizing variance in the DESeq2 package is varianceStabilizingTransformation. These two tranformations are similar, the rlog might perform a bit better when the size factors vary widely, and the varianceStabilizingTransformation is much faster when there are many samples.
vsd <- varianceStabilizingTransformation(dds)
plot(assay(vsd)[,1:2], cex=.1)
We can examine the standard deviation of rows over the mean for the vsd. Note that the genes with high variance for the log come from the genes with lowest mean. If these genes were included in a distance calculation, the high variance at the low count range might overwhelm the signal at the higher count range.
library(vsn)
meanSdPlot(assay(vsd), ranks=FALSE)
The principal components (PCA) plot is a useful diagnostic for examining relationships between samples:
Using the VST:
plotPCA(vsd, intgroup="Condition")
We can make this plot even nicer using custom code from the ggplot2 library:
library(ggplot2)
head((data <- plotPCA(vsd, intgroup=c("Condition","Label"), returnData=TRUE)))
## PC1 PC2 group Condition Label
## ZH10_NWM07_CTTGTA -4.403314 13.418230 Cancer : NWM Cancer NWM
## ZH10_rnaWT10 -17.049619 -9.921083 Healthy : rnaWT Healthy rnaWT
## ZH11_NWM09_ACAGTG 8.591741 8.930622 Cancer : NWM Cancer NWM
## ZH11_rnaWT11 17.688041 -27.323018 Healthy : rnaWT Healthy rnaWT
## ZH12_NWM10_GTGAAA 19.544231 -9.537662 Cancer : NWM Cancer NWM
## ZH12_rnaWT12 1.150560 -19.160230 Healthy : rnaWT Healthy rnaWT
## name
## ZH10_NWM07_CTTGTA ZH10_NWM07_CTTGTA
## ZH10_rnaWT10 ZH10_rnaWT10
## ZH11_NWM09_ACAGTG ZH11_NWM09_ACAGTG
## ZH11_rnaWT11 ZH11_rnaWT11
## ZH12_NWM10_GTGAAA ZH12_NWM10_GTGAAA
## ZH12_rnaWT12 ZH12_rnaWT12
(percentVar <- 100*round(attr(data, "percentVar"),2))
## [1] 24 15
makeLab <- function(x,pc) paste0("PC",pc,": ",x,"% variance")
ggplot(data, aes(PC1,PC2,col=Condition,shape=Label)) + geom_point() +
xlab(makeLab(percentVar[1],1)) + ylab(makeLab(percentVar[2],2))
In addition, we can plot a hierarchical clustering based on Euclidean distance matrix:
plot(hclust(dist(t(log.norm.counts))), labels=colData(dds)$Condition,cex = 0.75)
plot(hclust(dist(t(assay(vsd)))), labels=colData(vsd)$Condition,cex = 0.75)
We will now perform differential gene expression on the counts, to try to find genes in which the differences in expected counts across samples due to the condition of interest rises above the biological and technical variance we observe.
We will use an overdispersed Poisson distribution – called the negative binomial – to model the raw counts in the count matrix. The model will include the size factors into account to adjust for sequencing depth. The formula will look like:
\[ K_{ij} \sim \text{NB}(s_{ij} q_{ij}, \alpha_i ) \]
where \(K_{ij}\) is a single raw count in our count table, \(s_{ij}\) is a size factor or more generally a normalization factor, \(q_{ij}\) is proportional to gene expression (what we want to model with our design variables), and \(\alpha_i\) is a dispersion parameter.
Why bother modeling raw counts, rather than dividing out the sequencing depth and working with the normalized counts? In other words, why put the \(s_{ij}\) on the right side of the equation above, rather than dividing out on the left side and modeling \(K_{ij} / s_{ij}\). The reason is that, with the raw count, we have knowledge about the link between the expected value and its variance. So we prefer the first equation below to the second equation, because with the first equation, we have some additional information about the variance of the quantity on the left hand side.
\[ K_{ij} \sim \text{NB}(\mu_{ij} = s_{ij} q_{ij} ) \]
\[ \frac{K_{ij}}{s_{ij}} \sim \mathcal{L}(\mu_{ij} = q_{ij}) \]
When we sample cDNA fragments from a pool in a sequencing library, we can model the count of cDNA fragments which originated from a given gene with a binomial distribution, with a certain probability of picking a fragment for that gene which relates to factors such as the expression of that gene (the abundance of mRNA in the original population of cells), its length and technical factors in the production of the library. When we have many genes, and the rate for each gene is low, while the total number of fragments is high, we know that the Poisson is a good model for the binomial. And for the binomial and the Poisson, there is an explicit link between on observed count and its expected variance.
A number of methods for assessing differential gene expression from RNA-seq counts use the negative binomial distribution to make probabilistic statements about the differences seen in an experiment. A few such methods are edgeR, DESeq2, and DSS. Other methods, such as limma+voom find other ways to explicitly model the mean of log counts and the observed variance of log counts. A very incomplete list of statistical methods for RNA-seq differential expression is provided in the footnotes.
DESeq2 performs a similar step to limma as discussed in PH525x Course 3, in using the variance of all the genes to improve the variance estimate for each individual gene. In addition, DESeq2 shrinks the unreliable fold changes from genes with low counts, which will be seen in the resulting MA-plot.
Remember, we had created the DESeqDataSet object earlier using the following line of code (or alternatively using DESeqDataSetFromMatrix)
dds <- DESeqDataSet(ddsHTSeq, design= ~ Condition)
First, we setup the design of the experiment, so that differences will be considered across time and protocol variables. We can read and if necessary reset the design using the following code.
design(dds)
## ~Condition
design(dds) <- ~Condition
The last variable in the design is used by default for building results tables (although arguments to results can be used to customize the results table), and we make sure the “control” or “untreated” level is the first level, such that log fold changes will be treated over control, and not control over treated.
levels(dds$Condition)
## [1] "Cancer" "Healthy" "Unclear"
dds$Condition <- relevel(dds$Condition, "Unclear")
dds$Condition <- relevel(dds$Condition, "Healthy")
levels(dds$Condition)
## [1] "Healthy" "Unclear" "Cancer"
The following line runs the DESeq2 model. After this step, we can build a results table, which by default will compare the levels in the last variable in the design, so the Condition treatment in our case:
dds <- DESeq(dds)
res <- results(dds)
head(res)
## log2 fold change (MAP): Condition Cancer vs Healthy
## Wald test p-value: Condition Cancer vs Healthy
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## A1BG 310.314358 -1.1415070 0.2167890 -5.265521 1.397924e-07
## A1BG-AS1 34.601285 -0.4264891 0.2095279 -2.035477 4.180294e-02
## A1CF 23.336867 -2.2869681 0.5472728 -4.178845 2.929937e-05
## A2M 600.882478 -1.2885536 0.4425605 -2.911588 3.595970e-03
## A2M-AS1 6.710175 -1.3377740 0.4963390 -2.695283 7.032896e-03
## A2ML1 8.768966 -4.3359890 0.6396153 -6.779058 1.209618e-11
## padj
## <numeric>
## A1BG 7.363644e-07
## A1BG-AS1 6.472283e-02
## A1CF 9.823723e-05
## A2M 7.400281e-03
## A2M-AS1 1.343151e-02
## A2ML1 1.181273e-10
table(res$padj < 0.1)
##
## FALSE TRUE
## 6958 15171
A summary of the results can be generated:
summary(res)
##
## out of 23049 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 4329, 19%
## LFC < 0 (down) : 10842, 47%
## outliers [1] : 920, 4%
## low counts [2] : 1, 0.0043%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
For testing at a different threshold, we provide the alpha to results, so that the mean filtering is optimal for our new FDR threshold.
res2 <- results(dds, alpha=0.05)
table(res2$padj < 0.05)
##
## FALSE TRUE
## 8344 13785
resO5rdered <- res2[order(res2$padj),]
write.csv(as.data.frame(resO5rdered),
file="condition_treated_results.csv")
The MA-plot provides a global view of the differential genes, with the log2 fold change on the y-axis over the mean of normalized counts:
plotMA(res, ylim=c(-8,4))
We can also test against a different null hypothesis. For example, to test for genes which have fold change more than doubling or less than halving:
res.thr <- results(dds, lfcThreshold=1)
plotMA(res.thr, ylim=c(-8,4))
A p-value histogram:
hist(res$pvalue[res$baseMean > 1],
col="grey", border="white", xlab="", ylab="", main="")
A sorted results table:
resSort <- res[order(res$padj),]
head(resSort)
## log2 fold change (MAP): Condition Cancer vs Healthy
## Wald test p-value: Condition Cancer vs Healthy
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## POTEE 204.72678 -5.884769 0.2021133 -29.11619 2.239816e-186
## LINC00273 244.37937 -8.109809 0.3464320 -23.40953 3.417835e-121
## MAP1B 580.26255 -5.575665 0.2446174 -22.79341 5.329831e-115
## PDF 87.62892 -3.807874 0.1684273 -22.60842 3.581484e-113
## NUDT3 215.98888 -3.806948 0.1719760 -22.13651 1.407210e-108
## ENDOG 133.27318 -3.452335 0.1638058 -21.07578 1.327090e-98
## padj
## <numeric>
## POTEE 4.956490e-182
## LINC00273 3.781664e-117
## MAP1B 3.931461e-111
## PDF 1.981366e-109
## NUDT3 6.228032e-105
## ENDOG 4.894529e-95
Examine the counts for the top gene, sorting by p-value:
plotCounts(dds, gene=which.min(res$padj), intgroup="Condition")
Make normalized counts plots for the top 9 genes:
par(mfrow=c(3,3))
for (i in 1:9) plotCounts(dds, order(res$padj)[i], intgroup="Condition")
A more sophisticated plot of counts:
library(ggplot2)
data <- plotCounts(dds, gene=which.min(res$padj), intgroup=c("Condition","Label"), returnData=TRUE)
ggplot(data, aes(x=Condition, y=count, col=Label)) +
geom_point(position=position_jitter(width=.1,height=0)) +
scale_y_log10()
Connecting by lines shows the differences which are actually being tested by results given that our design includes cell + Condition
par(mfrow=c(1,1))
ggplot(data, aes(x=Condition, y=count, col=Label, group=Label)) +
geom_point() + geom_line() + scale_y_log10()
A heatmap of the top genes:
library(pheatmap)
topgenes <- head(rownames(resSort),20)
mat <- assay(vsd)[topgenes,]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(dds)[,c("Condition","Label")])
pheatmap(mat, annotation_col=df, fontsize_col=6)
We can then check the annotation of these highly significant genes:
#library(org.Hs.eg.db)
#keytypes(org.Hs.eg.db)
#anno <- select(org.Hs.eg.db, keys=topgenes,
# columns=c("SYMBOL","GENENAME"),
# keytype="ENSEMBL")
#anno[match(topgenes, anno$ENSEMBL),]
#for Bioconductor >= 3.1, easier to use mapIds() function
The contrast argument allows users to specify what results table should be built. See the help and examples in ?results for more details:
#results(dds, contrast=c("cell","N61311","N052611"))
If we suppose that we didn’t know about the different cell-lines in the experiment, but noticed some structure in the counts, we could use surrograte variable analysis (SVA) to detect this hidden structure (see PH525x Course 3 for details on the algorithm).
library(sva)
## Loading required package: mgcv
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:IRanges':
##
## collapse
## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.
## Loading required package: genefilter
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ Condition, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
## Number of significant surrogate variables is: 2
## Iteration (out of 5 ):1 2 3 4 5
Do the surrogate variables capture the cell difference?
par(mfrow=c(1,1))
plot(svseq$sv[,1], svseq$sv[,2], col=dds$Label,pch=16)
legend("topright",pch = 16, col=1:5,levels(dds$Label))
#text(svseq$sv[,1], svseq$sv[,2], 1:ncol(dds), pos=1)
Do the surrogate variables capture the health condition?
plot(svseq$sv[,1], svseq$sv[,2], col=dds$Condition,pch=16)
legend("topright",pch = 16, col=1:3,levels(dds$Condition))
#text(svseq$sv[,1], svseq$sv[,2], 1:ncol(dds), pos=1)
Using the surrogate variables in a DESeq2 analysis:
#dds.sva <- dds
#dds.sva$SV1 <- svseq$sv[,1]
#dds.sva$SV2 <- svseq$sv[,2]
#design(dds.sva) <- ~ SV1 + SV2 + Condition
#dds.sva <- DESeq(dds.sva)
sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X Yosemite 10.10.5
##
## locale:
## [1] C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] sva_3.22.0 genefilter_1.56.0
## [3] mgcv_1.8-17 nlme_3.1-131
## [5] pheatmap_1.0.8 ggplot2_2.2.1
## [7] hexbin_1.27.1 vsn_3.42.3
## [9] DESeq2_1.14.1 SummarizedExperiment_1.4.0
## [11] Biobase_2.34.0 GenomicRanges_1.26.4
## [13] GenomeInfoDb_1.10.3 IRanges_2.8.2
## [15] S4Vectors_0.12.2 BiocGenerics_0.20.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.10 locfit_1.5-9.1 lattice_0.20-35
## [4] rprojroot_1.2 digest_0.6.12 plyr_1.8.4
## [7] backports_1.0.5 acepack_1.4.1 RSQLite_1.1-2
## [10] evaluate_0.10 BiocInstaller_1.24.0 zlibbioc_1.20.0
## [13] lazyeval_0.2.0 data.table_1.10.4 annotate_1.52.1
## [16] rpart_4.1-11 Matrix_1.2-10 checkmate_1.8.2
## [19] preprocessCore_1.36.0 rmarkdown_1.5 labeling_0.3
## [22] splines_3.3.3 BiocParallel_1.8.2 geneplotter_1.52.0
## [25] stringr_1.2.0 foreign_0.8-68 htmlwidgets_0.8
## [28] RCurl_1.95-4.8 munsell_0.4.3 base64enc_0.1-3
## [31] htmltools_0.3.6 nnet_7.3-12 tibble_1.3.0
## [34] gridExtra_2.2.1 htmlTable_1.9 Hmisc_4.0-2
## [37] XML_3.98-1.6 bitops_1.0-6 grid_3.3.3
## [40] xtable_1.8-2 gtable_0.2.0 affy_1.52.0
## [43] DBI_0.6-1 magrittr_1.5 scales_0.4.1
## [46] stringi_1.1.5 XVector_0.14.1 affyio_1.44.0
## [49] limma_3.30.13 latticeExtra_0.6-28 Formula_1.2-1
## [52] RColorBrewer_1.1-2 tools_3.3.3 survival_2.41-3
## [55] yaml_2.1.14 AnnotationDbi_1.36.2 colorspace_1.3-2
## [58] cluster_2.0.6 memoise_1.1.0 knitr_1.15.1